In this script we are fitting species distribution models to the data of koalas (Phascolarctos cinereus) in the South-East Queensland (SEQ) region under current environmental conditions, which we will predict into future environmental conditions. Our modelling protocol follows these steps: 1. Load the koala presence data and environmental covariates. 2. Sample background points. 3. Select environmental covariates based on expert knowledge and correlation analysis. 4. Exploration of the koala presence and background data. 5. Fit a Generalised Linear Model (GLM) to the data based on current climatic conditions. 6. Fit a Random Forest Model to the data based on current climatic conditions. 7. Evaluate the model performance with spatial block cross validation. 8. Predict the model into future environmental conditions. 9. Summarise model uncertainty.
Up to date references can be downloaded from: https://paperpile.com/eb/EnPQSBmTeb
We wrote this script drawing on some of the following resources:
# general R functionslibrary(tidyverse)library(RColorBrewer)# for downloading species datalibrary(galah)# for general spatial operationslibrary(terra)library(sf)library(tidyterra)# for SDM analyseslibrary(predicts)library(blockCV)library(ecospat)library(usdm)library(precrec)library(corrplot)
Load South East Queensland (SEQ) boundary
We start by defining our study area, which is the South East Queensland (SEQ) region. In a previous session we used the Local Government Areas (LGA) shapefile to define the extent of SEQ.
What we’ll load is a polygon of our boundary which becomes our model domain.
For our coordinate reference system throughout, we’ll be using GDA2020 / MGA zone 56, which is identified by the EPSG code 7856.
Code
SEQ_extent.vect <-vect("Data/Environmental_variables/seq_boundary.gpkg")# Define an sf object as well (used later)SEQ_extent <-st_as_sf(SEQ_extent.vect, coords =c("x", "y"), crs =7856)plot(SEQ_extent.vect)
Step 1. Load the koala presence data and environmental covariates.
First, we’re going to get some koala occurrence data. We will use the galah package to access the Atlas of Living Australia (ALA) data. The ALA is a great resource for Australian biodiversity data, and the galah package provides a convenient interface to access it.
We won’t have time to explore these, but there are plenty of options for those using data outside of Australia. For example, the rgbif package provides access to the Global Biodiversity Information Facility (GBIF) data, and the spocc package provides access to a range of biodiversity data sources.
To cite the galah package
Code
# Package citationcitation(package ="galah")
To cite galah in publications use:
Westgate M, Kellie D, Stevenson M, Newman P (2025). _galah:
Biodiversity Data from the GBIF Node Network_. R package version
2.1.1, <https://CRAN.R-project.org/package=galah>.
A BibTeX entry for LaTeX users is
@Manual{,
title = {galah: Biodiversity Data from the GBIF Node Network},
author = {Martin Westgate and Dax Kellie and Matilda Stevenson and Peggy Newman},
year = {2025},
note = {R package version 2.1.1},
url = {https://CRAN.R-project.org/package=galah},
}
Atlas of Living Australia using “galah” package
To access records from the ALA, you will need to be registered.
There are two registration options:
If you are affiliated with an Australian institution, you should be able to register for ALA through your institution. Find your institution in the list of institutions when you select ‘Login’ on the ALA website. Follow the prompts to enter your institution email and password. If your institution is not listed, you will need to register for an ALA account.
If you are not affiliated with an Australian institution, you will need to register for an ALA account. You can do this by selecting ‘Register’ on the ALA website. Follow the prompts to enter your email and password.
Once you’re registered, you can set up your galah access configuration. This is a one-time setup that allows you to access the ALA data through the galah package. For now, we’ll just use Scott’s email address, but you should use your own email address once you’ve registered for an ALA account.
We’re using a species name call with a state filter to identify all Presence-Only occurrence records of koala for Queensland.
To prevent calling from the ALA every time we run this script, we will save the data to a CSV file. You can uncomment the code below to download the data again if you need to.
Not all data are created equal, and it’s always important to spend some time examining and cleaning your species occurrence data before using it in a model. This is particularly relevant for us because we’re using records collated into a database from a bunch of different sources. We don’t have time to explore this today, but here’s a few examples of things you might check at this stage and some functions / packages available to help:
When downloading data from the ALA or directly from GBIF, you can filter for specific fields. For example, you can filter for records with a specific coordinate precision or quality, or remove those records older than a certain date. You can use the galah_filter() function for this.
Check for duplicates in the data. You can use the dplyr package with functions like distinct().
Common issues like spatial outliers, taxonomic errors or coordinate errors are tested for with packages like: spocc, CoordinateCleaner, and bdc.
Usually you would check that the CRS for the occurrences is the same as the shapefile. However, we know that the `galah’ package operates in WGS84.
Code
# Check for missing values in decimalLongitude and decimalLatitudepaste0("Number of NAs in 'longitude' ", sum(is.na(koala_occurrences$decimalLongitude)))
[1] "Number of NAs in 'longitude' 215"
Code
paste0("Number of NAs in 'latitude' ", sum(is.na(koala_occurrences$decimalLatitude)))
[1] "Number of NAs in 'latitude' 215"
Code
koala_occ_sf <- koala_occurrences %>% tidyr::drop_na(decimalLongitude, decimalLatitude) %>%# remove NA valuesst_as_sf(coords =c("decimalLongitude", "decimalLatitude"), crs =4326) %>%st_transform(7856) # Transform to the same CRS as SEQ_extent
Plot the koala data across Queensland
Code
# Load a shapefile for Queensland just for plottingqld_shp =st_read('Data/Environmental_variables/QLD_State_Mask.shp') %>%st_transform(7856) # Transform to the same CRS as SEQ_extent
Reading layer `QLD_State_Mask' from data source
`/Users/scottforrest/Library/CloudStorage/OneDrive-QueenslandUniversityofTechnology/PhD - Scott Forrest/GIT/ICCB_geospatial_tools_conservation/Session 3/Data/Environmental_variables/QLD_State_Mask.shp'
using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 137.9946 ymin: -29.17927 xmax: 153.5519 ymax: -9.219566
Geodetic CRS: WGS 84
Code
# Create a map using ggplot2ggplot() +geom_sf(data = qld_shp, color ="black") +geom_sf(data = koala_occ_sf,color ="blue", size =0.5) +# Add points for occurrencesggtitle("Koala occurrences across Queensland") +# Add titletheme_bw()
Filter by SEQ region
This takes a minute or two.
Code
koala_occ_sf <- koala_occ_sf %>%st_intersection(SEQ_extent) %>%# Mask to extentdistinct() # drop any duplicated records
Plot the filtered koala data
Code
ggplot() +geom_sf(data = SEQ_extent, fill ="purple3", alpha =0.5, color ="black", size =0.2) +geom_sf(data = koala_occ_sf, # Add koala presence locationsaes(geometry = geometry),color ="blue", size =0.5) +# Add points for occurrencesggtitle("Koala occurrences in South East Queensland") +# Add titletheme_bw()
Step 2. Sample background points
Choosing background points to sample the availability of different environmental conditions is an important step in presence-only modelling. These points are contrasted against environmental conditions where your species was found (the presences) to help the model learn what conditions are suitable for the species. Background selection is a critical step in presence-only SDMs. Choices reflect your understanding of your study species. There’s lots of good discussion about approaches to background selection in the literature, and we recommend reading some of these papers to understand the implications of your choices.
For this tutorial, we will use random sampling of background points across the SEQ region to keep it simple.
A few other approaches include:
Buffering: Create a buffer around the presence points and sample points within that buffer. Figure from Velazco et al..
We need to load a template grid for creating background points. This matches the grid our covariates are on, which we’ll load soon.
We’re using a function from the predicts package for this.
Code
domain <-rast("Data/Environmental_variables/SEQ_current_bioclim.tif")# Make a blank raster of all 1s of the same dimensions as our covariate griddomain <- domain[[1]]domain <-ifel(is.na(domain), NA, 1) # Set all values to 1 except for NA values (which are outside the SEQ extentnames(domain) <-"SEQ_extent"plot(domain, main ="SEQ extent for background sampling")
Code
# Set the location and number of background points# The mask means that any NA (not SEQ) locations do not include background pointsbackground <- predicts::backgroundSample(mask = domain, n =2500)# Convert to terra SpatVector objectbackground <- terra::vect(background[,1:2], crs ="EPSG:7856")# Convert background points (SpatVector) to data framebackground_df <-as.data.frame(geom(background))koala_occ.vect <-vect(koala_occ_sf)# Plot the presences (blue) and the background points (grey)ggplot() +geom_sf(data = SEQ_extent, fill ="purple3", alpha =0.5, color ="black", size =0.2) +geom_spatvector(data = background, # Add koala presence locationscolor ="gray", cex =0.5) +# Add points for occurrencesgeom_spatvector(data = koala_occ.vect, aes(geometry = geometry), color ="blue", cex =0.5) +# Add background pointsggtitle("Koala occurrences (blue) and background points (grey) in South East Queensland") +# Add titletheme_bw()
Combine koala presence and background points as dataframes
Code
# Make a dataframe of just x, y and presencekoala_occ_df <- koala_occ_sf %>% dplyr::mutate(x = sf::st_coordinates(.)[,1],y = sf::st_coordinates(.)[,2]) %>%st_drop_geometry() %>% dplyr::select(x,y) %>%mutate(Presence =1)head(koala_occ_df)
# Combine to one dataframepr_bg <-rbind(koala_occ_df, background_df)
Step 3. Environmental covariate selection
First, we load the rasters describing the current environmental conditions. We did some pre-formatting of these rasters so they match the koala data in projection and extent.
Layers were made available to us by the EcoCommons team and were created by Toombs and Ma (2025):
Toombs, N., and Ma S., 2025, A High-Resolution Dataset of 19 Bioclimatic Indices over Australia, Climate Projections and Services – Queensland Treasury, Brisbane, Queensland. [https://longpaddock.qld.gov.au/qld-future-climate/data-info/tern/]
Code
covs_current <-rast("Data/Environmental_variables/SEQ_current_bioclim.tif")# Define the BIOCLIM names for the raster layerslayer_names <-c("BIO1_Annual_Mean_Temp","BIO2_Mean_Diurnal_Temp_Range","BIO3_Isothermality","BIO4_Temperature_Seasonality","BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO7_Temperature_Annual_Range","BIO8_Mean_Temp_Wettest_Quarter","BIO9_Mean_Temp_Driest_Quarter","BIO10_Mean_Temp_Warmest_Quarter","BIO11_Mean_Temp_Coldest_Quarter","BIO12_Annual_Precipitation","BIO13_Precip_Wettest_Month","BIO14_Precip_Driest_Month","BIO15_Precip_Seasonality","BIO16_Precip_Wettest_Quarter","BIO17_Precip_Driest_Quarter","BIO18_Precip_Warmest_Quarter","BIO19_Precip_Coldest_Quarter")names(covs_current) <- layer_names
One approach: Narrow down potential covariates based on ecological or expert knowledge
For this example, we had advice from scientists at CSIRO who conducted an expert elicitation to gather a set of potential covariates that are likely to be important for koalas.
This is an example of one approach for deciding what covariates are candidates for inclusion in your model. We use this expert knowledge to filter out the key bioclim variables.
We select the following:
Bio5 : Max temp of the warmest month (mainly for the northern populations)
Bio6 : Min temp of the coldest month (mainly for southern populations, which essentially excludes alpine regions)
Bio12 : Annual Precipitation
Bio15 : Precipitation seasonality (coefficient of variation).
Plotting our four covariates from expert elicitation
Code
# select only the covariates we are using by their namescovs_current_expert <-subset(covs_current, names(covs_current) %in%c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality"))for(i in1:nlyr(covs_current_expert)) {plot(covs_current_expert[[i]], main =names(covs_current_expert)[[i]])}
Covariate to account for sampling bias
CHARLOTTE PLEASE CHECK PARAGRAPH, and add any relevant refs
As these data were collected from a wide number of sources, we have to consider whether any of the data, such as that which was opportunistically collected (Kéry et al. 2010), might be biased towards areas of higher human activity. This is a common issue in species occurrence data and is typpically called smapling bias or preferential sampling, and occurs because areas that are more accessible or have higher human activity are more likely to be opportunistically sampled (Conn, Thorson, and Johnson 2017).
This means that if there are environmental variables that happen correlated with human activity (which is often the case, humans also select for environmental features), it will appear that these variables are more influential than they actually are, because we associating that environmental covariate with ‘more’ presence records.
Sampling pseudo-absences where there are more points where there is higher sampling effort
Using a covariate to attempt to capture the sampling bias, such as human population density or distance to roads
Using a model or model structure that explicitly accounts for sampling bias, such as including a spatial random effect
In our case, we have will use a covariate for the Global Human Footprint Index (HFP) , which is a measure of human impact on the environment. This is a raster layer with continuous (percentage values) that we will use to account for sampling bias in our model.
You will be able to see Brisbane as the largest area of human footprint, Moreton Bay to the right, the Gold Coast to the south, the Sunshine Coast to the north, and Toowoomba to the west (about 2 hours drive).
Code
human_footprint <-rast("Data/Environmental_variables/cost_hfp2013.tif")names(human_footprint) <-"human_footprint"# Update CRS to EPSG:7856 (GDA2020 / MGA zone 56)human_footprint <- terra::project(human_footprint, "EPSG:7856")# Resample to match the extent and resolution of the covariateshuman_footprint <-resample(human_footprint, covs_current_expert, method ="bilinear")# Mask to SEQ extenthuman_footprint <- terra::mask(human_footprint, covs_current_expert[[1]]) plot(human_footprint, main ="Human Footprint Index")
Although this isn’t a thorough comparison, we can see that this variable is correlated with the presence of koalas.
Code
plot(human_footprint, main ="Human Footprint Index")points(koala_occ_sf, col ="red", alpha =0.1, pch =20, cex =0.5)
You would want to put some more thought into your own study and the best way to account for sampling bias, but in our case we will try using the human footprint index as a covariate in our model, and assess how our predictions change.
Extract environmental covariate values from presence and background locations (training locations)
This will create a dataframe where each row is a koala presence or background location and each column is a value for our four bioclimatic variables at that location.
We use a function from terra for this.
Code
# this will give us the covariate values for each presence and background pointPB_covs <- terra::extract(x = covs_current_expert_HF, # Raster with environmental variablesy = pr_bg[, c("x", "y")], # Dataframe with x, y and presenceID =FALSE# Don't return an ID column for each location)train_PB_covs <-cbind(pr_bg, PB_covs) # Combine the presence column with the covariate values# drop any NAstrain_PB_covs <- train_PB_covs %>%drop_na()head(train_PB_covs)
Thin the koala presence points (for tutorial only)
We now thin the presences to reduce the number of points to a manageable size for plotting and modelling. This is just for the purpose of this tutorial, and is done here to make the tutorial run faster and to make the plots clearer. There are some reasons you would want to thin (such as bias correction), but you would want to carefully consider whether this is appropriate for your modelling.
We only thin the presence points as the background points are already limited by the number of cells in the SEQ region.
Code
train_PB_covs_pres <- train_PB_covs %>%filter(Presence ==1)train_PB_covs_bg <- train_PB_covs %>%filter(Presence ==0)# Thin the presences - 10000 presence pointstrain_PB_covs_pres_thinned <- train_PB_covs_pres[sample(nrow(train_PB_covs_pres), 10000), ]# Combine back into both presence and backgroundtrain_PB_covs_thinned <-rbind(train_PB_covs_pres_thinned, train_PB_covs_bg, make.row.names =FALSE)
Check correlation and multicollinearity of covariates
Although we’ve narrowed down our covariates already based on discussion with experts, there’s a chance that our remaining variables might be correlated or collinear. This would cause issues with models, particularly because we’re hoping to predict our model into the future. We’re going to inspect our covariates for signs of correlation and multicolinearity.
There are several different methods for creating correlation plots, we just show two.
Correlation plot from the ecospat package
Code
# use the dataframe we just created, dropping the x, y, and presence columns (1-3)ecospat::ecospat.cor.plot(train_PB_covs_thinned[, -1:-3])
What the correlation test plot suggests is that we have some correlated covariates. Particularly our Bio6 and Bio12.
Often a rule of thumb of ~ 0.7 correlation is used to decide what a ‘too correlated’ covariate pair are (REF). In practice, it’s important to think carefully before dropping covariates.
Variance Inflation Factor (VIF)
If you find corrplot is hard for you to use to make decisions, we can also look at the Variance Inflation Factor (VIF). VIF is another statistical measure used to detect multicollinearity in a set of explanatory (independent) variables in a regression model.
Interpretation:
VIF = 1: No correlation
VIF > 1 and <= 5: Moderate correlation; may not require corrective action.
VIF > 5: Indicates high correlation. Multicollinearity may be problematic, and further investigation is recommended.
VIF > 10: Strong multicollinearity. The variable is highly collinear with others, and steps should be taken to address this.
Code
usdm::vifstep(covs_current_expert_HF) # Variance Inflation Factor and test for multicollinearity
No variable from the 5 input variables has collinearity problem.
The linear correlation coefficients ranges between:
min correlation ( BIO15_Precip_Seasonality ~ BIO6_Min_Temp_Coldest_Month ): -0.1420642
max correlation ( BIO12_Annual_Precipitation ~ BIO6_Min_Temp_Coldest_Month ): 0.8762638
---------- VIFs of the remained variables --------
Variables VIF
1 BIO5_Max_Temp_Warmest_Month 2.658798
2 BIO6_Min_Temp_Coldest_Month 5.916991
3 BIO12_Annual_Precipitation 7.552088
4 BIO15_Precip_Seasonality 1.330505
5 human_footprint 1.551557
Step 4. Exploration of the koala presence and background data
It is good practice to assess where in the environmental space the presence and background points are located. This can help to identify if there are any potential issues with the data, such as a lack of background points in certain areas of environmental space. It’s also a good way to have a first look at any patterns in the species data and the environmental covariates.
ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO5_Max_Temp_Warmest_Month"]], fill =as.factor(Presence)),alpha =0.5,bw =0.25) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO5_Max_Temp_Warmest_Month",fill ="Presence") +theme(legend.position ="bottom")
Code
ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO6_Min_Temp_Coldest_Month"]], fill =as.factor(Presence)),alpha =0.5,bw =0.25) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO6_Min_Temp_Coldest_Month",fill ="Presence") +theme(legend.position ="bottom")
Code
ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO12_Annual_Precipitation"]], fill =as.factor(Presence)),alpha =0.5,bw =50) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO12_Annual_Precipitation",fill ="Presence") +theme(legend.position ="bottom")
Code
ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO15_Precip_Seasonality"]], fill =as.factor(Presence)),alpha =0.5,bw =1) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO15_Precip_Seasonality",fill ="Presence") +theme(legend.position ="bottom")
Scaling the covariates
It is common to scale the covariates before fitting a model. This is because some models, such as Generalised Linear Models (GLMs), can be sensitive to the scale of the covariates. Scaling the covariates can also make the coefficients more interpretable comparable, as they are now on similar scales to each other.
A typical scaling approach is to subtract the mean and divide by the standard deviation of each covariate, which is also known as z-scaling, and after this operation, all covariates will have a mean of 0 and a standard deviation of 1.
We can use base R to scale the covariates, and the default is to centre (subtract the mean) and scale (divide by the standard deviation).
Code
# pull out just the covariate values from the dataframecovariate_values <- train_PB_covs_thinned[, -1:-3]# scale the covariates - returns a matrixscaled_covariates <- base::scale(covariate_values, center =TRUE, scale =TRUE)# convert back to a dataframe (and remove row names)scaled_covariates_df <-data.frame(scaled_covariates)head(scaled_covariates_df)
Code
# Add the presence column back to the scaled covariatestrain_PB_covs_thinned_scaled <-cbind(train_PB_covs_thinned[, 1:3], scaled_covariates_df)
The scaling operation also returns the particular scaling values that were used, which is helpful for generating predictions later on. We can access these using the attributes of the returned dataframe. We only need the scaling factor, as we will rescale the coefficients before generating predictions.
Code
# Get the attributes of the scaled covariatesscaled_attributes <-attributes(scaled_covariates)scaled_attributes$`scaled:scale`# the SD/scaling factor of each covariate
Step 5. Fit a model: A generalised linear model (GLM)
Code
# Make a folder to save outputsdir.create("Outputs/GLM_outputs", showWarnings = F)
Null model
Null model: no explanatory variables or predictors are included.
It is always helpful to create a null model as a benchmark to assess how the inclusion of explanatory variables improves the model.
Code
# Fit a null model with only the interceptnull_model <-glm(Presence ~1,data = train_PB_covs_thinned_scaled,family =binomial(link ="logit"))# Check the model resultssummary(null_model)
Call:
glm(formula = Presence ~ 1, family = binomial(link = "logit"),
data = train_PB_covs_thinned_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.10046 0.03028 69.37 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7734 on 11223 degrees of freedom
Residual deviance: 7734 on 11223 degrees of freedom
AIC: 7736
Number of Fisher Scoring iterations: 4
GLM - expert variables with linear terms
In this model, we include linear terms for the covariates. You can also do things like add quadratic terms to account for non-linear relationships between the predictors and the response variable. This increases the complexity of the model and allows for more flexibility in fitting the data.
You can also try fitting the model with and without the scaled covariates. The z- and p-values of each covariate should be the same (but not for the intercept and centering the covariates affects that), as well as the AIC.
Code
glm_model <-glm(Presence ~ BIO5_Max_Temp_Warmest_Month + BIO6_Min_Temp_Coldest_Month + BIO12_Annual_Precipitation + BIO15_Precip_Seasonality + human_footprint,data=train_PB_covs_thinned_scaled,family =binomial(link ="logit"))# Check the model resultssummary(glm_model)
Call:
glm(formula = Presence ~ BIO5_Max_Temp_Warmest_Month + BIO6_Min_Temp_Coldest_Month +
BIO12_Annual_Precipitation + BIO15_Precip_Seasonality + human_footprint,
family = binomial(link = "logit"), data = train_PB_covs_thinned_scaled)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.51987 0.07562 46.549 < 2e-16 ***
BIO5_Max_Temp_Warmest_Month -0.48449 0.05511 -8.792 < 2e-16 ***
BIO6_Min_Temp_Coldest_Month 0.39503 0.07556 5.228 1.71e-07 ***
BIO12_Annual_Precipitation -0.12531 0.07848 -1.597 0.11
BIO15_Precip_Seasonality 0.67846 0.04465 15.196 < 2e-16 ***
human_footprint 1.44056 0.06810 21.154 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 7734.0 on 11223 degrees of freedom
Residual deviance: 4494.9 on 11218 degrees of freedom
AIC: 4506.9
Number of Fisher Scoring iterations: 7
Model effect evaluation
Here we use a function presented in an EcoCommons Australia notebook to evaluate the model performance. The notebook can be found on their GitHub: https://github.com/EcoCommonsAustralia/notebooks/tree/main/notebooks.
Code
# Function to plot effect size graphplot_effect_size <-function(glm_model) {# Check if required libraries are installedif (!requireNamespace("ggplot2", quietly =TRUE)) {stop("Please install the 'ggplot2' package to use this function.") }library(ggplot2)# Extract effect sizes (coefficients) from the model coefs <-summary(glm_model)$coefficients effect_sizes <-data.frame(Variable =rownames(coefs)[-1], # Exclude the interceptEffect_Size = coefs[-1, "Estimate"],Std_Error = coefs[-1, "Std. Error"] )# Sort by effect size effect_sizes <- effect_sizes[order(-abs(effect_sizes$Effect_Size)), ]# Plot the effect sizes with error barsggplot(effect_sizes, aes(x =reorder(Variable, Effect_Size), y = Effect_Size)) +geom_hline(yintercept =0, linetype ="dashed", color ="black", alpha =0.5) +geom_point(stat ="identity", colour ="#11aa96") +geom_errorbar(aes(ymin = Effect_Size - Std_Error, ymax = Effect_Size + Std_Error), colour ="#11aa96", width =0.1) +coord_flip() +labs(title ="Effect Sizes of Variables",x ="Variable",y ="Effect Size (Coefficient Estimate)" ) +theme_bw()}
Run the function on the covariates used in the model.
Interestingly, the human footprint index has the largest effect size, followed by the annual precipitation and the minimum temperature of the coldest month, both of which were positive. The maximum temperature of the warmest month has a negative coefficient, which suggests that koalas are less likely to be found in areas with higher temperatures during the warmest month.
Code
plot_effect_size(glm_model)
Response curves
Again, we can use a function from the EcoCommons notebook to plot the response curves from the model.
Code
plot_species_response <-function(glm_model, predictors, data) {# Check if required libraries are installedif (!requireNamespace("ggplot2", quietly =TRUE) ||!requireNamespace("gridExtra", quietly =TRUE)) {stop("Please install the 'ggplot2' and 'gridExtra' packages to use this function.") }library(ggplot2)library(gridExtra)# Create empty list to store response plots response_plots <-list()# Loop through each predictor variablefor (predictor in predictors) {# Create new data frame to vary predictor while keeping others constant pred_range <-seq(min(data[[predictor]], na.rm =TRUE),max(data[[predictor]], na.rm =TRUE),length.out =100 ) const_data <- data[1, , drop =FALSE] # Use first row to keep other predictors constant response_data <- const_data[rep(1, 100), ] # Duplicate the row response_data[[predictor]] <- pred_range# Predict probabilities predicted_response <-predict(glm_model, newdata = response_data, type ="response")# Create data frame for plotting plot_data <-data.frame(Predictor_Value = pred_range,Predicted_Probability = predicted_response )# Add presence and absence data presence_absence_data <-data.frame(Predictor_Value = data[[predictor]],Presence_Absence = data$Presence )# Generate the response plot p <-ggplot() +geom_line(data = plot_data,aes(x = Predictor_Value, y = Predicted_Probability),color ="#61c6fa", linewidth =1) +geom_point(data = presence_absence_data[presence_absence_data$Presence_Absence ==1, ],aes(x = Predictor_Value, y = Presence_Absence),color ="#11aa96", alpha =0.2) +geom_point(data = presence_absence_data[presence_absence_data$Presence_Absence ==0, ],aes(x = Predictor_Value, y = Presence_Absence),color ="#f6aa70", alpha =0.2) +labs(x = predictor, y =NULL) +theme_bw() +theme(axis.title.y =element_blank())# Store the plot in the list response_plots[[predictor]] <- p }# Arrange all plots in one combined plot with a single shared y-axis labelgrid.arrange(grobs = response_plots,ncol =3,left ="Predicted Probability / Presence-Absence" )}
Plot the response curves
Code
# get the names of the covariates in the model (drop the intercept)predictors <-names(glm_model$coefficients)[-1]# Plot the response curves for the predictors in the modelplot_species_response(glm_model, predictors, train_PB_covs_thinned_scaled)
Scale the raster layers with the same scaling as the covariates
Code
# Scale your prediction rasters using the same scaling parameterscovs_current_expert_HF_scaled <-scale(covs_current_expert_HF, center =attr(scaled_covariates, "scaled:center"),scale =attr(scaled_covariates, "scaled:scale"))# plot to check the scaling workedplot(covs_current_expert_HF_scaled)
GLM predictions to current environment
Model 1
Code
# Predict the presence probability across the entire raster extentpredicted_current_biased <- predicts::predict(covs_current_expert_HF_scaled, glm_model, type ="response")# Plot the species distribution rasterplot( predicted_current_biased,range =c(0, 1), # Set min and max values for the color scalemain ="Climatic Suitability of Koalas in SEQ (biased)")
Accounting for the sampling bias
What we did above did not account for the sampling bias, as the predictions we generated maintained the bias process. We need to instead set the human footprint index to a constant value across the entire extent, which represents if the process that produces the bias is the same everywhere.
The approach that we’ve taken is not perfect, and would require some thinking and investigation, but it illustrates the potential impact of sampling bias and how you might account for it.
Code
# pull out the scaled human footprint index layerconstant_HF <- covs_current_expert_HF_scaled[[5]]# Set the human footprint index to a constant value across the entire extent# Mean value of the human footprint indexconstant_HF[] <-mean(constant_HF[], na.rm =TRUE)# mask to the extent of the covariatesconstant_HF <- terra::mask(constant_HF, covs_current_expert_HF_scaled[[5]])# add the constant human footprint back into the raster layerscovs_current_expert_constHF_scaled <-c(covs_current_expert_HF_scaled[[1:4]], constant_HF)# check the new human footprint layerplot(covs_current_expert_constHF_scaled)
Generate predictions with the constant human footprint layer
Code
# Predict the presence probability across the entire raster extentpredicted_current <- predicts::predict(covs_current_expert_constHF_scaled, glm_model, type ="response")# Plot the species distribution rasterplot( predicted_current,range =c(0, 1), # Set min and max values for the color scalemain ="Climatic Suitability of Koalas in SEQ")
Model evaluation with spatial block cross-validation
Code
# Convert training data to sftrain_PB_covs_thinned_sf <-st_as_sf(train_PB_covs_thinned[, c("x", "y", "Presence")], coords =c("x", "y"), crs ="EPSG:7856")# Generate spatial blocksspblock <-cv_spatial(x = train_PB_covs_thinned_sf,column ="Presence",r =NULL,size =50000, # Size of the blocks in metresk =5,hexagon =TRUE,selection ="random",iteration =100, # to find evenly-dispersed foldsbiomod2 =FALSE,progress =FALSE)
# Extract the folds to savespfolds <- spblock$folds_list# We now have a list of 5 folds, where the first object is the training data, and the second is the testing datastr(spfolds)
Typically, it helps to evaluate your model with several metrics that describe different features of model performance and prediction. Here, we define a function to feed in a model prediction and calculate several evaluation metrics.
The metrics are:
Area under the receiver operating characteristic curve (AUC ROC)
Higher values of this (closer to 1) suggest a model is good at distinguishing presence points from the background.
Continuous boyce index
Higher values of this (closer to 1) suggest a model is good at predicting higher suitability at spots where there were presences.
Code
# Start a dataframe to save resultseval_df <-data.frame(fold =numeric(),ROC =numeric(),boyce =numeric())for(f inseq_along(spfolds)) {# Subset the training and testing data (spatial cross validation) (for the fth fold) train_PB_covs_scv <- train_PB_covs_thinned[spfolds[[f]][[1]], ] test_PB_covs_scv <- train_PB_covs_thinned[spfolds[[f]][[2]], ] glm_model_fold <-glm(Presence ~ BIO5_Max_Temp_Warmest_Month + BIO6_Min_Temp_Coldest_Month + BIO12_Annual_Precipitation + BIO15_Precip_Seasonality + human_footprint,data=train_PB_covs_scv,family =binomial(link ="logit"))# Predict to the testing data of fold f test_PB_covs_scv$pred <-predict(glm_model_fold, newdata = test_PB_covs_scv, type ="response")# Evaluate prediction on test set ROC = precrec::auc(precrec::evalmod(scores = test_PB_covs_scv$pred, labels = test_PB_covs_scv$Presence))[1,4] boyce = ecospat::ecospat.boyce(fit = test_PB_covs_scv$pred,obs = test_PB_covs_scv$pred[which(test_PB_covs_scv$Presence==1)],nclass =0, # Calculate continuous indexmethod ="pearson",PEplot = F)[["cor"]]# Add results to dataframe eval_df <- eval_df %>%add_row(fold = f, ROC = ROC, boyce = boyce)}
Summarise the evaluation metrics
Code
# Mean AUC & boyceeval_df %>%summarise(mean_AUC =mean(ROC),mean_boyce =mean(boyce),sd_AUC =sd(ROC),sd_boyce =sd(boyce))
covs_future_SSP370 <- terra::mask(covs_future_SSP370, covs_current) # Crop to SEQ extent using current layerscovs_future_SSP370_expert <-subset(covs_future_SSP370, names(covs_future_SSP370) %in%c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality"))
Test which areas you might mask out because they are ‘novel’ in environmental space and therefore require model extrapolation.
Code
analog_fut <- predicted_currentvalues(analog_fut)[values(mess)<0] <-NAplot(analog_fut,range =c(0, 1), # Set min and max values for the color scalemain ="Koala relative occurrence in regions with analogue conditions")
Code
novel_fut <- predicted_currentvalues(novel_fut)[values(mess)>0] <-NAplot(novel_fut,range =c(0, 1), # Set min and max values for the color scalemain ="Koala relative occurrence in regions with novel conditions")
GLM future predictions
Scale future layers and add human footprint
Code
# Scale the future covariates using the same scaling parameters as the training datacovs_future_SSP370_expert_scaled <-scale(covs_future_SSP370_expert, center =attr(scaled_covariates, "scaled:center")[-5],scale =attr(scaled_covariates, "scaled:scale")[-5])# Add the human footprint layer to the future covariatescovs_future_SSP370_expert_HF_scaled <-c(covs_future_SSP370_expert_scaled, constant_HF)
Model 1
Code
# Predict the presence probability across the entire raster extentpredicted_future370 <-predict(covs_future_SSP370_expert_HF_scaled, glm_model, type ="response")# Plot the species distribution rasterplot( predicted_future370,range =c(0, 1), # Set min and max values for the color scalemain ="Future Climatic Suitability of Koalas in SEQ - SSP370")
Red is less suitable in the Future SSP370 scenario, and blue is more suitable in the Future SSP370 scenario.
Code
# return to single plottingpar(mfrow =c(1, 1))# Create symmetric breaks centered at 0breaks <-seq(-1, 1, length.out =101)# Create the color palettecols <-colorRampPalette(c("red", "white", "blue"))(100)plot(predicted_future370 - predicted_current,main ="Future (SSP370) - Current Predictions ",col = cols,range =c(-1, 1))
Presenting predictions with uncertainty
There are many sources of model uncertainty that should be explored and ideally, presented alongside model predictions.
One that we’ll focus on here is climate scenario uncertainty. We do so by fitting a second model to future climate data from a lower emission shared socioeconomic path scenario (SSP 1.26).
covs_future_SSP126_expert <-subset(covs_future_SSP126, names(covs_future_SSP126) %in%c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality"))plot(covs_future_SSP126_expert) # to plot the layers
Plot the difference between the two future scenarios
This is SSP370 - SSP126, so positive values indicate that the SSP370 scenario has higher values than the SSP126 scenario.
# Scale the future covariates using the same scaling parameters as the training datacovs_future_SSP126_expert_scaled <-scale(covs_future_SSP126_expert, center =attr(scaled_covariates, "scaled:center")[-5],scale =attr(scaled_covariates, "scaled:scale")[-5])# Add the human footprint layer to the future covariatescovs_future_SSP126_expert_HF_scaled <-c(covs_future_SSP126_expert_scaled, constant_HF)# Predict the presence probability across the entire raster extentpredicted_future126 <-predict(covs_future_SSP126_expert_HF_scaled, glm_model, type ="response")
Plot the different between current and future (SSP126)
Red is less suitable in the Future SSP126 scenario, and blue is more suitable in the Future SSP126 scenario.
Code
# return to single plottingpar(mfrow =c(1, 1))# Create symmetric breaks centered at 0breaks <-seq(-1, 1, length.out =101)# Create the color palettecols <-colorRampPalette(c("red", "white", "blue"))(100)plot(predicted_future126 - predicted_current,main ="Future (SSP370) - Current Predictions ",col = cols,range =c(-1, 1))
Plot the different between the two future predictions
Red is less suitable in the Future SSP370 scenario than the SSP126 scenario, and blue is more suitable in Future SSP370 scenario than the SSP126 scenario.
Code
# return to single plottingpar(mfrow =c(1, 1))# Create symmetric breaks centered at 0breaks <-seq(-1, 1, length.out =101)# Create the color palettecols <-colorRampPalette(c("red", "white", "blue"))(100)plot(predicted_future370 - predicted_future126,main ="Future (SSP370) - Future (SSP126) Predictions ",col = cols,range =c(-1, 1))
Model uncertainty
Another element of uncertainty that can be represented is model uncertainty, or the standard error around the coefficient estimates.
Code
# Extract standard errors of coefficientscoef_se <-summary(glm_model)$coefficients[, "Std. Error"]print(coef_se)
covs_df <-as.data.frame(covs_future_SSP126_expert_HF_scaled, na.rm =FALSE)pred_link <-predict(glm_model, newdata = covs_df, type ="link", se.fit =TRUE)# Linear predictor (eta)eta <- pred_link$fitse_eta <- pred_link$se.fit# Confidence intervals (95%)z <-1.96eta_lower <- eta - z * se_etaeta_upper <- eta + z * se_eta# Transform back to response scalelinkinv <- glm_model$family$linkinvpredicted <-linkinv(eta)lower_ci <-linkinv(eta_lower)upper_ci <-linkinv(eta_upper)# Add to covs_dfcovs_df$predicted <- predictedcovs_df$lower_ci <- lower_cicovs_df$upper_ci <- upper_cipredicted_r <-setValues(rast(covs_future_SSP126_expert_HF_scaled, nlyr =1), predicted)lower_ci_r <-setValues(rast(covs_future_SSP126_expert_HF_scaled, nlyr =1), lower_ci)upper_ci_r <-setValues(rast(covs_future_SSP126_expert_HF_scaled, nlyr =1), upper_ci)# Step 2: Name the layersnames(predicted_r) <-"SSP126 Mean Estimate"names(lower_ci_r) <-"SSP126 Lower CI"names(upper_ci_r) <-"SSP126 Upper CI"prediction_w_uncertainty <-c(predicted_r, lower_ci_r, upper_ci_r)plot(prediction_w_uncertainty, range =c(0, 1))
References
Conn, Paul B, James T Thorson, and Devin S Johnson. 2017. “Confronting Preferential Sampling When Analysing Population Distributions: Diagnosis and Model‐based Triage.”Methods in Ecology and Evolution 8 (11): 1535–46. https://doi.org/10.1111/2041-210x.12803.
Dubos, Nicolas, Clémentine Préau, Maxime Lenormand, Guillaume Papuga, Sophie Monsarrat, Pierre Denelle, Marine Le Louarn, et al. 2022. “Assessing the Effect of Sample Bias Correction in Species Distribution Models.”Ecological Indicators 145 (109487): 109487. https://doi.org/10.1016/j.ecolind.2022.109487.
Kéry, Marc, J Andrew Royle, Hans Schmid, Michael Schaub, Bernard Volet, Guido Häfliger, and Niklaus Zbinden. 2010. “Site-Occupancy Distribution Modeling to Correct Population-Trend Estimates Derived from Opportunistic Observations: Distribution Trends from Opportunistic Observations.”Conservation Biology: The Journal of the Society for Conservation Biology 24 (5): 1388–97. https://doi.org/10.1111/j.1523-1739.2010.01479.x.
Mäkinen, Jussi, Cory Merow, and Walter Jetz. 2024. “Integrated Species Distribution Models to Account for Sampling Biases and Improve Range‐wide Occurrence Predictions.”Global Ecology and Biogeography: A Journal of Macroecology 33 (3): 356–70. https://doi.org/10.1111/geb.13792.
Pennino, Maria Grazia, Iosu Paradinas, Janine B Illian, Facundo Muñoz, José María Bellido, Antonio López-Quílez, and David Conesa. 2019. “Accounting for Preferential Sampling in Species Distribution Models.”Ecology and Evolution 9 (1): 653–63. https://doi.org/10.1002/ece3.4789.
---title: "Species distribution modelling"author: "Scott Forrest and Charlotte Patterson"date: todayexecute: cache: false warning: falsebibliography: paperpile_references.bibtoc: truenumber-sections: falseformat: html: self-contained: true code-fold: show code-tools: true df-print: paged code-line-numbers: true code-overflow: scroll fig-format: png fig-dpi: 300 pdf: geometry: - top=30mm - left=30mmeditor: sourceabstract: | In this script we are fitting species distribution models to the data of koalas (Phascolarctos cinereus) in the South-East Queensland (SEQ) region under current environmental conditions, which we will predict into future environmental conditions. Our modelling protocol follows these steps: 1. Load the koala presence data and environmental covariates. 2. Sample background points. 3. Select environmental covariates based on expert knowledge and correlation analysis. 4. Exploration of the koala presence and background data. 5. Fit a Generalised Linear Model (GLM) to the data based on current climatic conditions. 6. Fit a Random Forest Model to the data based on current climatic conditions. 7. Evaluate the model performance with spatial block cross validation. 8. Predict the model into future environmental conditions. 9. Summarise model uncertainty. ---Up to date references can be downloaded from: https://paperpile.com/eb/EnPQSBmTeb We wrote this script drawing on some of the following resources:- Ecocommons Notebooks <https://www.ecocommons.org.au/notebooks/>- Damaris Zurell's SDM Intro <https://damariszurell.github.io/SDM-Intro/>## Import packages```{r}#| message = FALSE# general R functionslibrary(tidyverse)library(RColorBrewer)# for downloading species datalibrary(galah)# for general spatial operationslibrary(terra)library(sf)library(tidyterra)# for SDM analyseslibrary(predicts)library(blockCV)library(ecospat)library(usdm)library(precrec)library(corrplot)```## Load South East Queensland (SEQ) boundaryWe start by defining our study area, which is the South East Queensland (SEQ) region. In a previous session we used the Local Government Areas (LGA) shapefile to define the extent of SEQ.What we'll load is a polygon of our boundary which becomes our model domain.For our coordinate reference system throughout, we'll be using GDA2020 / MGA zone 56, which is identified by the EPSG code 7856.```{r}SEQ_extent.vect <-vect("Data/Environmental_variables/seq_boundary.gpkg")# Define an sf object as well (used later)SEQ_extent <-st_as_sf(SEQ_extent.vect, coords =c("x", "y"), crs =7856)plot(SEQ_extent.vect)```# Step 1. Load the koala presence data and environmental covariates.First, we're going to get some koala occurrence data. We will use the `galah` package to access the Atlas of Living Australia (ALA) data. The ALA is a great resource for Australian biodiversity data, and the `galah` package provides a convenient interface to access it.We won't have time to explore these, but there are plenty of options for those using data outside of Australia. For example, the `rgbif` package provides access to the Global Biodiversity Information Facility (GBIF) data, and the `spocc` package provides access to a range of biodiversity data sources.### To cite the `galah` package```{r}# Package citationcitation(package ="galah")```### Atlas of Living Australia using "galah" packageTo access records from the ALA, you will need to be registered.There are two registration options:1. If you are affiliated with an Australian institution, you should be able to register for ALA through your institution. Find your institution in the list of institutions when you select 'Login' [on the ALA website](https://www.ala.org.au/). Follow the prompts to enter your institution email and password. If your institution is not listed, you will need to register for an ALA account.2. If you are not affiliated with an Australian institution, you will need to register for an ALA account. You can do this by selecting 'Register' [on the ALA website](https://www.ala.org.au/). Follow the prompts to enter your email and password.Once you're registered, you can set up your galah access configuration. This is a one-time setup that allows you to access the ALA data through the `galah` package. For now, we'll just use Scott's email address, but you should use your own email address once you've registered for an ALA account.```{r}galah_config(atlas ="ALA",email ="scott.forrest@hdr.qut.edu.au" )```### Download koala data from ALAWe're using a species name call with a state filter to identify all Presence-Only occurrence records of koala for Queensland.To prevent calling from the ALA every time we run this script, we will save the data to a CSV file. You can uncomment the code below to download the data again if you need to.```{r}# koala_occurrences <- galah_call() %>% # galah_identify("Phascolarctos cinereus") %>% # galah_filter(# stateProvince == "Queensland",# occurrenceStatus == "PRESENT"# ) %>% # atlas_occurrences()# save the csv# write_csv(koala_occurrences, "Data/Biological_records/koala_occurrences.csv")koala_occurrences <-read_csv("Data/Biological_records/koala_occurrences.csv")```This download takes a minute or so, and we end up with a dataframe of 93456 records of koala!```{r}nrow(koala_occurrences)names(koala_occurrences)```### Cleaning koala dataNot all data are created equal, and it's always important to spend some time examining and cleaning your species occurrence data before using it in a model. This is particularly relevant for us because we're using records collated into a database from a bunch of different sources. We don't have time to explore this today, but here's a few examples of things you might check at this stage and some functions / packages available to help:- When downloading data from the ALA or directly from GBIF, you can filter for specific fields. For example, you can filter for records with a specific coordinate precision or quality, or remove those records older than a certain date. You can use the `galah_filter()` function for this.- Check for duplicates in the data. You can use the `dplyr` package with functions like `distinct()`.- Common issues like spatial outliers, taxonomic errors or coordinate errors are tested for with packages like: `spocc`, `CoordinateCleaner`, and `bdc`.- Usually you would check that the CRS for the occurrences is the same as the shapefile. However, we know that the \`galah' package operates in WGS84.```{r}# Check for missing values in decimalLongitude and decimalLatitudepaste0("Number of NAs in 'longitude' ", sum(is.na(koala_occurrences$decimalLongitude)))paste0("Number of NAs in 'latitude' ", sum(is.na(koala_occurrences$decimalLatitude)))koala_occ_sf <- koala_occurrences %>% tidyr::drop_na(decimalLongitude, decimalLatitude) %>%# remove NA valuesst_as_sf(coords =c("decimalLongitude", "decimalLatitude"), crs =4326) %>%st_transform(7856) # Transform to the same CRS as SEQ_extent```### Plot the koala data across Queensland```{r}# Load a shapefile for Queensland just for plottingqld_shp =st_read('Data/Environmental_variables/QLD_State_Mask.shp') %>%st_transform(7856) # Transform to the same CRS as SEQ_extent# Create a map using ggplot2ggplot() +geom_sf(data = qld_shp, color ="black") +geom_sf(data = koala_occ_sf,color ="blue", size =0.5) +# Add points for occurrencesggtitle("Koala occurrences across Queensland") +# Add titletheme_bw()```### Filter by SEQ regionThis takes a minute or two.```{r}koala_occ_sf <- koala_occ_sf %>%st_intersection(SEQ_extent) %>%# Mask to extentdistinct() # drop any duplicated records```### Plot the filtered koala data```{r}ggplot() +geom_sf(data = SEQ_extent, fill ="purple3", alpha =0.5, color ="black", size =0.2) +geom_sf(data = koala_occ_sf, # Add koala presence locationsaes(geometry = geometry),color ="blue", size =0.5) +# Add points for occurrencesggtitle("Koala occurrences in South East Queensland") +# Add titletheme_bw()```# Step 2. Sample background pointsChoosing background points to sample the availability of different environmental conditions is an important step in presence-only modelling. These points are contrasted against environmental conditions where your species was found (the presences) to help the model learn what conditions are suitable for the species. Background selection is a critical step in presence-only SDMs. Choices reflect your understanding of your study species. There's lots of good discussion about approaches to background selection in the literature, and we recommend reading some of these papers to understand the implications of your choices.For this tutorial, we will use random sampling of background points across the SEQ region to keep it simple.A few other approaches include:- **Buffering**: Create a buffer around the presence points and sample points within that buffer. Figure from [Velazco et al.](https://sjevelazco.github.io/flexsdm/articles/v01_pre_modeling.html#background-and-pseudo-absence-sampling).[Buffered background locations](Figures/Buffered_Background.png)```{r}#| echo: false# knitr::include_graphics("Figures/Buffered_Background.png")```- **Minimum convex hull**: Create a minimum convex hull around the presence points and sample points within that hull. Figure from [Velazco et al.](https://sjevelazco.github.io/flexsdm/articles/v01_pre_modeling.html#background-and-pseudo-absence-sampling).[Minimum convex hull background locations](Figures/Minimum_convex_polygon_Background.png)```{r}#| echo: false# knitr::include_graphics("Figures/Minimum_convex_polygon_Background.png")```## Random background samplingWe need to load a template grid for creating background points. This matches the grid our covariates are on, which we'll load soon.We're using a function from the `predicts` package for this.```{r}domain <-rast("Data/Environmental_variables/SEQ_current_bioclim.tif")# Make a blank raster of all 1s of the same dimensions as our covariate griddomain <- domain[[1]]domain <-ifel(is.na(domain), NA, 1) # Set all values to 1 except for NA values (which are outside the SEQ extentnames(domain) <-"SEQ_extent"plot(domain, main ="SEQ extent for background sampling")# Set the location and number of background points# The mask means that any NA (not SEQ) locations do not include background pointsbackground <- predicts::backgroundSample(mask = domain, n =2500)# Convert to terra SpatVector objectbackground <- terra::vect(background[,1:2], crs ="EPSG:7856")# Convert background points (SpatVector) to data framebackground_df <-as.data.frame(geom(background))koala_occ.vect <-vect(koala_occ_sf)# Plot the presences (blue) and the background points (grey)ggplot() +geom_sf(data = SEQ_extent, fill ="purple3", alpha =0.5, color ="black", size =0.2) +geom_spatvector(data = background, # Add koala presence locationscolor ="gray", cex =0.5) +# Add points for occurrencesgeom_spatvector(data = koala_occ.vect, aes(geometry = geometry), color ="blue", cex =0.5) +# Add background pointsggtitle("Koala occurrences (blue) and background points (grey) in South East Queensland") +# Add titletheme_bw()```### Combine koala presence and background points as dataframes```{r}# Make a dataframe of just x, y and presencekoala_occ_df <- koala_occ_sf %>% dplyr::mutate(x = sf::st_coordinates(.)[,1],y = sf::st_coordinates(.)[,2]) %>%st_drop_geometry() %>% dplyr::select(x,y) %>%mutate(Presence =1)head(koala_occ_df)background_df <- background %>%as.data.frame(geom ="XY") %>% dplyr::select(x,y) %>%mutate(Presence =0)head(background_df)# Combine to one dataframepr_bg <-rbind(koala_occ_df, background_df)```# Step 3. Environmental covariate selectionFirst, we load the rasters describing the current environmental conditions. We did some pre-formatting of these rasters so they match the koala data in projection and extent.Layers were made available to us by the EcoCommons team and were created by Toombs and Ma (2025):Toombs, N., and Ma S., 2025, A High-Resolution Dataset of 19 Bioclimatic Indices over Australia, Climate Projections and Services – Queensland Treasury, Brisbane, Queensland. \[https://longpaddock.qld.gov.au/qld-future-climate/data-info/tern/\]```{r}covs_current <-rast("Data/Environmental_variables/SEQ_current_bioclim.tif")# Define the BIOCLIM names for the raster layerslayer_names <-c("BIO1_Annual_Mean_Temp","BIO2_Mean_Diurnal_Temp_Range","BIO3_Isothermality","BIO4_Temperature_Seasonality","BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO7_Temperature_Annual_Range","BIO8_Mean_Temp_Wettest_Quarter","BIO9_Mean_Temp_Driest_Quarter","BIO10_Mean_Temp_Warmest_Quarter","BIO11_Mean_Temp_Coldest_Quarter","BIO12_Annual_Precipitation","BIO13_Precip_Wettest_Month","BIO14_Precip_Driest_Month","BIO15_Precip_Seasonality","BIO16_Precip_Wettest_Quarter","BIO17_Precip_Driest_Quarter","BIO18_Precip_Warmest_Quarter","BIO19_Precip_Coldest_Quarter")names(covs_current) <- layer_names```### One approach: Narrow down potential covariates based on ecological or expert knowledgeFor this example, we had advice from scientists at [CSIRO](https://www.csiro.au/en/) who conducted an expert elicitation to gather a set of potential covariates that are likely to be important for koalas.This is an example of one approach for deciding what covariates are candidates for inclusion in your model. We use this expert knowledge to filter out the key bioclim variables.We select the following:- Bio5 : Max temp of the warmest month (mainly for the northern populations)- Bio6 : Min temp of the coldest month (mainly for southern populations, which essentially excludes alpine regions)- Bio12 : Annual Precipitation- Bio15 : Precipitation seasonality (coefficient of variation).#### Plotting our four covariates from expert elicitation```{r}# select only the covariates we are using by their namescovs_current_expert <-subset(covs_current, names(covs_current) %in%c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality"))for(i in1:nlyr(covs_current_expert)) {plot(covs_current_expert[[i]], main =names(covs_current_expert)[[i]])}```## Covariate to account for sampling bias### CHARLOTTE PLEASE CHECK PARAGRAPH, and add any relevant refs ###As these data were collected from a wide number of sources, we have to consider whether any of the data, such as that which was opportunistically collected [@Kery2010-aa], might be biased towards areas of higher human activity. This is a common issue in species occurrence data and is typpically called smapling bias or preferential sampling, and occurs because areas that are more accessible or have higher human activity are more likely to be opportunistically sampled [@Conn2017-lh].This means that if there are environmental variables that happen correlated with human activity (which is often the case, humans also select for environmental features), it will appear that these variables are more influential than they actually are, because we associating that environmental covariate with 'more' presence records.There are several different ways to account for this when modelling [@Pennino2019-sy; @Makinen2024-th; @Dubos2022-ye]. Some of the common approaches include:- Sampling pseudo-absences where there are more points where there is higher sampling effort- Using a covariate to attempt to capture the sampling bias, such as human population density or distance to roads- Using a model or model structure that explicitly accounts for sampling bias, such as including a spatial random effectIn our case, we have will use a covariate for the [Global Human Footprint Index (HFP)](https://hub.arcgis.com/maps/65518e782be04e7db31de65d53d591a9/about) , which is a measure of human impact on the environment. This is a raster layer with continuous (percentage values) that we will use to account for sampling bias in our model.You will be able to see Brisbane as the largest area of human footprint, Moreton Bay to the right, the Gold Coast to the south, the Sunshine Coast to the north, and Toowoomba to the west (about 2 hours drive).```{r}human_footprint <-rast("Data/Environmental_variables/cost_hfp2013.tif")names(human_footprint) <-"human_footprint"# Update CRS to EPSG:7856 (GDA2020 / MGA zone 56)human_footprint <- terra::project(human_footprint, "EPSG:7856")# Resample to match the extent and resolution of the covariateshuman_footprint <-resample(human_footprint, covs_current_expert, method ="bilinear")# Mask to SEQ extenthuman_footprint <- terra::mask(human_footprint, covs_current_expert[[1]]) plot(human_footprint, main ="Human Footprint Index")```Although this isn't a thorough comparison, we can see that this variable is correlated with the presence of koalas.```{r}plot(human_footprint, main ="Human Footprint Index")points(koala_occ_sf, col ="red", alpha =0.1, pch =20, cex =0.5)```You would want to put some more thought into your own study and the best way to account for sampling bias, but in our case we will try using the human footprint index as a covariate in our model, and assess how our predictions change.### Add the human footprint layer to our covariates```{r}covs_current_expert_HF <-c(covs_current_expert, human_footprint)plot(covs_current_expert_HF)```### Extract environmental covariate values from presence and background locations (training locations)This will create a dataframe where each row is a koala presence or background location and each column is a value for our four bioclimatic variables at that location.We use a function from `terra` for this.```{r}# this will give us the covariate values for each presence and background pointPB_covs <- terra::extract(x = covs_current_expert_HF, # Raster with environmental variablesy = pr_bg[, c("x", "y")], # Dataframe with x, y and presenceID =FALSE# Don't return an ID column for each location)train_PB_covs <-cbind(pr_bg, PB_covs) # Combine the presence column with the covariate values# drop any NAstrain_PB_covs <- train_PB_covs %>%drop_na()head(train_PB_covs)```## Thin the koala presence points (for tutorial only)We now thin the presences to reduce the number of points to a manageable size for plotting and modelling. *This is just for the purpose of this tutorial*, and is done here to make the tutorial run faster and to make the plots clearer. There are some reasons you would want to thin (such as bias correction), but you would want to carefully consider whether this is appropriate for your modelling.We only thin the presence points as the background points are already limited by the number of cells in the SEQ region.```{r}train_PB_covs_pres <- train_PB_covs %>%filter(Presence ==1)train_PB_covs_bg <- train_PB_covs %>%filter(Presence ==0)# Thin the presences - 10000 presence pointstrain_PB_covs_pres_thinned <- train_PB_covs_pres[sample(nrow(train_PB_covs_pres), 10000), ]# Combine back into both presence and backgroundtrain_PB_covs_thinned <-rbind(train_PB_covs_pres_thinned, train_PB_covs_bg, make.row.names =FALSE)```## Check correlation and multicollinearity of covariatesAlthough we've narrowed down our covariates already based on discussion with experts, there's a chance that our remaining variables might be correlated or collinear. This would cause issues with models, particularly because we're hoping to predict our model into the future. We're going to inspect our covariates for signs of correlation and multicolinearity.There are several different methods for creating correlation plots, we just show two.### Correlation plot from the ecospat package```{r}# use the dataframe we just created, dropping the x, y, and presence columns (1-3)ecospat::ecospat.cor.plot(train_PB_covs_thinned[, -1:-3])```What the correlation test plot suggests is that we have some correlated covariates. Particularly our Bio6 and Bio12.Often a rule of thumb of \~ 0.7 correlation is used to decide what a 'too correlated' covariate pair are (REF). In practice, it's important to think carefully before dropping covariates.### Variance Inflation Factor (VIF)If you find corrplot is hard for you to use to make decisions, we can also look at the Variance Inflation Factor (VIF). VIF is another statistical measure used to detect multicollinearity in a set of explanatory (independent) variables in a regression model.**Interpretation:**- VIF = 1: No correlation- VIF \> 1 and \<= 5: Moderate correlation; may not require corrective action.- VIF \> 5: Indicates high correlation. Multicollinearity may be problematic, and further investigation is recommended.- VIF \> 10: Strong multicollinearity. The variable is highly collinear with others, and steps should be taken to address this.```{r}usdm::vifstep(covs_current_expert_HF) # Variance Inflation Factor and test for multicollinearity```# Step 4. Exploration of the koala presence and background dataIt is good practice to assess where in the environmental space the presence and background points are located. This can help to identify if there are any potential issues with the data, such as a lack of background points in certain areas of environmental space. It's also a good way to have a first look at any patterns in the species data and the environmental covariates.::: panel-tabset## Bio5 Max. temperature warmest month```{r}ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO5_Max_Temp_Warmest_Month"]], fill =as.factor(Presence)),alpha =0.5,bw =0.25) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO5_Max_Temp_Warmest_Month",fill ="Presence") +theme(legend.position ="bottom")```## Bio6 Min. temperature coldest month```{r}ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO6_Min_Temp_Coldest_Month"]], fill =as.factor(Presence)),alpha =0.5,bw =0.25) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO6_Min_Temp_Coldest_Month",fill ="Presence") +theme(legend.position ="bottom")```## Bio12 Annual precipitation```{r}ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO12_Annual_Precipitation"]], fill =as.factor(Presence)),alpha =0.5,bw =50) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO12_Annual_Precipitation",fill ="Presence") +theme(legend.position ="bottom")```## Bio15 Precipitation seasonality```{r}ggplot() +geom_density(data = train_PB_covs_thinned,aes(x = .data[["BIO15_Precip_Seasonality"]], fill =as.factor(Presence)),alpha =0.5,bw =1) +# we can try different bandwidth values to smooth the densitiestheme_bw() +labs(title ="BIO15_Precip_Seasonality",fill ="Presence") +theme(legend.position ="bottom")```:::## Scaling the covariatesIt is common to scale the covariates before fitting a model. This is because some models, such as Generalised Linear Models (GLMs), can be sensitive to the scale of the covariates. Scaling the covariates can also make the coefficients more interpretable comparable, as they are now on similar scales to each other.A typical scaling approach is to subtract the mean and divide by the standard deviation of each covariate, which is also known as z-scaling, and after this operation, all covariates will have a mean of 0 and a standard deviation of 1.We can use base R to scale the covariates, and the default is to centre (subtract the mean) and scale (divide by the standard deviation).```{r}# pull out just the covariate values from the dataframecovariate_values <- train_PB_covs_thinned[, -1:-3]# scale the covariates - returns a matrixscaled_covariates <- base::scale(covariate_values, center =TRUE, scale =TRUE)# convert back to a dataframe (and remove row names)scaled_covariates_df <-data.frame(scaled_covariates)head(scaled_covariates_df)# Add the presence column back to the scaled covariatestrain_PB_covs_thinned_scaled <-cbind(train_PB_covs_thinned[, 1:3], scaled_covariates_df)```The scaling operation also returns the particular scaling values that were used, which is helpful for generating predictions later on. We can access these using the attributes of the returned dataframe. We only need the scaling factor, as we will rescale the coefficients before generating predictions.```{r}# Get the attributes of the scaled covariatesscaled_attributes <-attributes(scaled_covariates)scaled_attributes$`scaled:scale`# the SD/scaling factor of each covariate```# Step 5. Fit a model: A generalised linear model (GLM)```{r}# Make a folder to save outputsdir.create("Outputs/GLM_outputs", showWarnings = F)```## Null modelNull model: no explanatory variables or predictors are included.It is always helpful to create a null model as a benchmark to assess how the inclusion of explanatory variables improves the model.```{r}# Fit a null model with only the interceptnull_model <-glm(Presence ~1,data = train_PB_covs_thinned_scaled,family =binomial(link ="logit"))# Check the model resultssummary(null_model)```## GLM - expert variables with linear termsIn this model, we include linear terms for the covariates. You can also do things like add quadratic terms to account for non-linear relationships between the predictors and the response variable. This increases the complexity of the model and allows for more flexibility in fitting the data.You can also try fitting the model with and without the scaled covariates. The z- and p-values of each covariate should be the same (but not for the intercept and centering the covariates affects that), as well as the AIC.```{r}glm_model <-glm(Presence ~ BIO5_Max_Temp_Warmest_Month + BIO6_Min_Temp_Coldest_Month + BIO12_Annual_Precipitation + BIO15_Precip_Seasonality + human_footprint,data=train_PB_covs_thinned_scaled,family =binomial(link ="logit"))# Check the model resultssummary(glm_model)```## Model effect evaluationHere we use a function presented in an EcoCommons Australia notebook to evaluate the model performance. The notebook can be found on their GitHub: https://github.com/EcoCommonsAustralia/notebooks/tree/main/notebooks.```{r}#| code-fold: true# Function to plot effect size graphplot_effect_size <-function(glm_model) {# Check if required libraries are installedif (!requireNamespace("ggplot2", quietly =TRUE)) {stop("Please install the 'ggplot2' package to use this function.") }library(ggplot2)# Extract effect sizes (coefficients) from the model coefs <-summary(glm_model)$coefficients effect_sizes <-data.frame(Variable =rownames(coefs)[-1], # Exclude the interceptEffect_Size = coefs[-1, "Estimate"],Std_Error = coefs[-1, "Std. Error"] )# Sort by effect size effect_sizes <- effect_sizes[order(-abs(effect_sizes$Effect_Size)), ]# Plot the effect sizes with error barsggplot(effect_sizes, aes(x =reorder(Variable, Effect_Size), y = Effect_Size)) +geom_hline(yintercept =0, linetype ="dashed", color ="black", alpha =0.5) +geom_point(stat ="identity", colour ="#11aa96") +geom_errorbar(aes(ymin = Effect_Size - Std_Error, ymax = Effect_Size + Std_Error), colour ="#11aa96", width =0.1) +coord_flip() +labs(title ="Effect Sizes of Variables",x ="Variable",y ="Effect Size (Coefficient Estimate)" ) +theme_bw()}```Run the function on the covariates used in the model.Interestingly, the human footprint index has the largest effect size, followed by the annual precipitation and the minimum temperature of the coldest month, both of which were positive. The maximum temperature of the warmest month has a negative coefficient, which suggests that koalas are less likely to be found in areas with higher temperatures during the warmest month.```{r}plot_effect_size(glm_model)```## Response curvesAgain, we can use a function from the EcoCommons notebook to plot the response curves from the model.```{r}#| code-fold: trueplot_species_response <-function(glm_model, predictors, data) {# Check if required libraries are installedif (!requireNamespace("ggplot2", quietly =TRUE) ||!requireNamespace("gridExtra", quietly =TRUE)) {stop("Please install the 'ggplot2' and 'gridExtra' packages to use this function.") }library(ggplot2)library(gridExtra)# Create empty list to store response plots response_plots <-list()# Loop through each predictor variablefor (predictor in predictors) {# Create new data frame to vary predictor while keeping others constant pred_range <-seq(min(data[[predictor]], na.rm =TRUE),max(data[[predictor]], na.rm =TRUE),length.out =100 ) const_data <- data[1, , drop =FALSE] # Use first row to keep other predictors constant response_data <- const_data[rep(1, 100), ] # Duplicate the row response_data[[predictor]] <- pred_range# Predict probabilities predicted_response <-predict(glm_model, newdata = response_data, type ="response")# Create data frame for plotting plot_data <-data.frame(Predictor_Value = pred_range,Predicted_Probability = predicted_response )# Add presence and absence data presence_absence_data <-data.frame(Predictor_Value = data[[predictor]],Presence_Absence = data$Presence )# Generate the response plot p <-ggplot() +geom_line(data = plot_data,aes(x = Predictor_Value, y = Predicted_Probability),color ="#61c6fa", linewidth =1) +geom_point(data = presence_absence_data[presence_absence_data$Presence_Absence ==1, ],aes(x = Predictor_Value, y = Presence_Absence),color ="#11aa96", alpha =0.2) +geom_point(data = presence_absence_data[presence_absence_data$Presence_Absence ==0, ],aes(x = Predictor_Value, y = Presence_Absence),color ="#f6aa70", alpha =0.2) +labs(x = predictor, y =NULL) +theme_bw() +theme(axis.title.y =element_blank())# Store the plot in the list response_plots[[predictor]] <- p }# Arrange all plots in one combined plot with a single shared y-axis labelgrid.arrange(grobs = response_plots,ncol =3,left ="Predicted Probability / Presence-Absence" )}```### Plot the response curves```{r}# get the names of the covariates in the model (drop the intercept)predictors <-names(glm_model$coefficients)[-1]# Plot the response curves for the predictors in the modelplot_species_response(glm_model, predictors, train_PB_covs_thinned_scaled)```### Scale the raster layers with the same scaling as the covariates```{r}# Scale your prediction rasters using the same scaling parameterscovs_current_expert_HF_scaled <-scale(covs_current_expert_HF, center =attr(scaled_covariates, "scaled:center"),scale =attr(scaled_covariates, "scaled:scale"))# plot to check the scaling workedplot(covs_current_expert_HF_scaled)```## GLM predictions to current environment### Model 1```{r}# Predict the presence probability across the entire raster extentpredicted_current_biased <- predicts::predict(covs_current_expert_HF_scaled, glm_model, type ="response")# Plot the species distribution rasterplot( predicted_current_biased,range =c(0, 1), # Set min and max values for the color scalemain ="Climatic Suitability of Koalas in SEQ (biased)")```## Accounting for the sampling biasWhat we did above did not account for the sampling bias, as the predictions we generated maintained the bias process. We need to instead set the human footprint index to a constant value across the entire extent, which represents if the process that produces the bias is the same everywhere.The approach that we've taken is not perfect, and would require some thinking and investigation, but it illustrates the potential impact of sampling bias and how you might account for it.```{r}# pull out the scaled human footprint index layerconstant_HF <- covs_current_expert_HF_scaled[[5]]# Set the human footprint index to a constant value across the entire extent# Mean value of the human footprint indexconstant_HF[] <-mean(constant_HF[], na.rm =TRUE)# mask to the extent of the covariatesconstant_HF <- terra::mask(constant_HF, covs_current_expert_HF_scaled[[5]])# add the constant human footprint back into the raster layerscovs_current_expert_constHF_scaled <-c(covs_current_expert_HF_scaled[[1:4]], constant_HF)# check the new human footprint layerplot(covs_current_expert_constHF_scaled)```Generate predictions with the constant human footprint layer```{r}# Predict the presence probability across the entire raster extentpredicted_current <- predicts::predict(covs_current_expert_constHF_scaled, glm_model, type ="response")# Plot the species distribution rasterplot( predicted_current,range =c(0, 1), # Set min and max values for the color scalemain ="Climatic Suitability of Koalas in SEQ")```# Model evaluation with spatial block cross-validation```{r}# Convert training data to sftrain_PB_covs_thinned_sf <-st_as_sf(train_PB_covs_thinned[, c("x", "y", "Presence")], coords =c("x", "y"), crs ="EPSG:7856")# Generate spatial blocksspblock <-cv_spatial(x = train_PB_covs_thinned_sf,column ="Presence",r =NULL,size =50000, # Size of the blocks in metresk =5,hexagon =TRUE,selection ="random",iteration =100, # to find evenly-dispersed foldsbiomod2 =FALSE,progress =FALSE)cv_plot(cv = spblock,x = train_PB_covs_thinned_sf,points_alpha =0.5,nrow =2)# Extract the folds to savespfolds <- spblock$folds_list# We now have a list of 5 folds, where the first object is the training data, and the second is the testing datastr(spfolds)```## Run the model for every fold and evaluate### Model evaluation - metricsTypically, it helps to evaluate your model with several metrics that describe different features of model performance and prediction. Here, we define a function to feed in a model prediction and calculate several evaluation metrics.The metrics are:**Area under the receiver operating characteristic curve (AUC ROC)**- Higher values of this (closer to 1) suggest a model is good at distinguishing presence points from the background.**Continuous boyce index**- Higher values of this (closer to 1) suggest a model is good at predicting higher suitability at spots where there were presences.```{r}# Start a dataframe to save resultseval_df <-data.frame(fold =numeric(),ROC =numeric(),boyce =numeric())for(f inseq_along(spfolds)) {# Subset the training and testing data (spatial cross validation) (for the fth fold) train_PB_covs_scv <- train_PB_covs_thinned[spfolds[[f]][[1]], ] test_PB_covs_scv <- train_PB_covs_thinned[spfolds[[f]][[2]], ] glm_model_fold <-glm(Presence ~ BIO5_Max_Temp_Warmest_Month + BIO6_Min_Temp_Coldest_Month + BIO12_Annual_Precipitation + BIO15_Precip_Seasonality + human_footprint,data=train_PB_covs_scv,family =binomial(link ="logit"))# Predict to the testing data of fold f test_PB_covs_scv$pred <-predict(glm_model_fold, newdata = test_PB_covs_scv, type ="response")# Evaluate prediction on test set ROC = precrec::auc(precrec::evalmod(scores = test_PB_covs_scv$pred, labels = test_PB_covs_scv$Presence))[1,4] boyce = ecospat::ecospat.boyce(fit = test_PB_covs_scv$pred,obs = test_PB_covs_scv$pred[which(test_PB_covs_scv$Presence==1)],nclass =0, # Calculate continuous indexmethod ="pearson",PEplot = F)[["cor"]]# Add results to dataframe eval_df <- eval_df %>%add_row(fold = f, ROC = ROC, boyce = boyce)}```### Summarise the evaluation metrics```{r}# Mean AUC & boyceeval_df %>%summarise(mean_AUC =mean(ROC),mean_boyce =mean(boyce),sd_AUC =sd(ROC),sd_boyce =sd(boyce))```# Make predictions to future climates## Load future environmental data```{r}covs_future_SSP370 <-rast("Data/Environmental_variables/SEQ_future_bioclim.2090.SSP370.tif")names(covs_future_SSP370) <- layer_namescovs_future_SSP370covs_future_SSP370 <- terra::mask(covs_future_SSP370, covs_current) # Crop to SEQ extent using current layerscovs_future_SSP370_expert <-subset(covs_future_SSP370, names(covs_future_SSP370) %in%c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality"))```## Plot the future rasters```{r}plot(covs_future_SSP370_expert)```### Compare the current and future rasters```{r}plot(covs_future_SSP370_expert - covs_current_expert)```## Test the environmental distance between current data and future conditions```{r}mess <- predicts::mess(covs_future_SSP370_expert, train_PB_covs_thinned[, c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality")])plot(mess)r_mess_mask <- mess <0plot(r_mess_mask)```Test which areas you might mask out because they are 'novel' in environmental space and therefore require model extrapolation.```{r}analog_fut <- predicted_currentvalues(analog_fut)[values(mess)<0] <-NAplot(analog_fut,range =c(0, 1), # Set min and max values for the color scalemain ="Koala relative occurrence in regions with analogue conditions")novel_fut <- predicted_currentvalues(novel_fut)[values(mess)>0] <-NAplot(novel_fut,range =c(0, 1), # Set min and max values for the color scalemain ="Koala relative occurrence in regions with novel conditions")```# GLM future predictions## Scale future layers and add human footprint```{r}# Scale the future covariates using the same scaling parameters as the training datacovs_future_SSP370_expert_scaled <-scale(covs_future_SSP370_expert, center =attr(scaled_covariates, "scaled:center")[-5],scale =attr(scaled_covariates, "scaled:scale")[-5])# Add the human footprint layer to the future covariatescovs_future_SSP370_expert_HF_scaled <-c(covs_future_SSP370_expert_scaled, constant_HF)```## Model 1```{r}# Predict the presence probability across the entire raster extentpredicted_future370 <-predict(covs_future_SSP370_expert_HF_scaled, glm_model, type ="response")# Plot the species distribution rasterplot( predicted_future370,range =c(0, 1), # Set min and max values for the color scalemain ="Future Climatic Suitability of Koalas in SEQ - SSP370")```## Show the predictions side-by-side```{r}par(mfrow =c(1, 2))plot( predicted_current,range =c(0, 1),main ="Current")plot( predicted_future370,range =c(0, 1),main ="Future (SSP 3.70)")```## Plot the different between current and futureRed is **less** suitable in the Future SSP370 scenario, and blue is **more** suitable in the Future SSP370 scenario.```{r}# return to single plottingpar(mfrow =c(1, 1))# Create symmetric breaks centered at 0breaks <-seq(-1, 1, length.out =101)# Create the color palettecols <-colorRampPalette(c("red", "white", "blue"))(100)plot(predicted_future370 - predicted_current,main ="Future (SSP370) - Current Predictions ",col = cols,range =c(-1, 1))```## Presenting predictions with uncertaintyThere are many sources of model uncertainty that should be explored and ideally, presented alongside model predictions.One that we'll focus on here is climate scenario uncertainty. We do so by fitting a second model to future climate data from a lower emission shared socioeconomic path scenario (SSP 1.26).## Load future environmental data (SSP 1.26)```{r}covs_future_SSP126 <-rast("Data/Environmental_variables/SEQ_future_bioclim.2090.SSP126.tif")names(covs_future_SSP126) <- layer_namescovs_future_SSP126covs_future_SSP126_expert <-subset(covs_future_SSP126, names(covs_future_SSP126) %in%c("BIO5_Max_Temp_Warmest_Month","BIO6_Min_Temp_Coldest_Month","BIO12_Annual_Precipitation","BIO15_Precip_Seasonality"))plot(covs_future_SSP126_expert) # to plot the layers```### Plot the difference between the two future scenariosThis is SSP370 - SSP126, so positive values indicate that the SSP370 scenario has higher values than the SSP126 scenario.```{r}terra::plot(covs_future_SSP370_expert - covs_future_SSP126_expert)```## GLM future predictions (SSP 1.26)### Model 1```{r}# Scale the future covariates using the same scaling parameters as the training datacovs_future_SSP126_expert_scaled <-scale(covs_future_SSP126_expert, center =attr(scaled_covariates, "scaled:center")[-5],scale =attr(scaled_covariates, "scaled:scale")[-5])# Add the human footprint layer to the future covariatescovs_future_SSP126_expert_HF_scaled <-c(covs_future_SSP126_expert_scaled, constant_HF)# Predict the presence probability across the entire raster extentpredicted_future126 <-predict(covs_future_SSP126_expert_HF_scaled, glm_model, type ="response")```## Plot the different between current and future (SSP126)Red is **less** suitable in the Future SSP126 scenario, and blue is **more** suitable in the Future SSP126 scenario.```{r}# return to single plottingpar(mfrow =c(1, 1))# Create symmetric breaks centered at 0breaks <-seq(-1, 1, length.out =101)# Create the color palettecols <-colorRampPalette(c("red", "white", "blue"))(100)plot(predicted_future126 - predicted_current,main ="Future (SSP370) - Current Predictions ",col = cols,range =c(-1, 1))```## Compare the two future predictions```{r}par(mfrow =c(1, 2))# Plot the predicted suitability plot( predicted_future126,range =c(0,1),main ="Suitability SSP126")plot( predicted_future370,range =c(0,1),main ="Suitability SSP370")```## Plot the different between the two future predictionsRed is **less** suitable in the Future SSP370 scenario than the SSP126 scenario, and blue is **more** suitable in Future SSP370 scenario than the SSP126 scenario.```{r}# return to single plottingpar(mfrow =c(1, 1))# Create symmetric breaks centered at 0breaks <-seq(-1, 1, length.out =101)# Create the color palettecols <-colorRampPalette(c("red", "white", "blue"))(100)plot(predicted_future370 - predicted_future126,main ="Future (SSP370) - Future (SSP126) Predictions ",col = cols,range =c(-1, 1))```## Model uncertaintyAnother element of uncertainty that can be represented is model uncertainty, or the standard error around the coefficient estimates.```{r}# Extract standard errors of coefficientscoef_se <-summary(glm_model)$coefficients[, "Std. Error"]print(coef_se)``````{r}covs_df <-as.data.frame(covs_future_SSP126_expert_HF_scaled, na.rm =FALSE)pred_link <-predict(glm_model, newdata = covs_df, type ="link", se.fit =TRUE)# Linear predictor (eta)eta <- pred_link$fitse_eta <- pred_link$se.fit# Confidence intervals (95%)z <-1.96eta_lower <- eta - z * se_etaeta_upper <- eta + z * se_eta# Transform back to response scalelinkinv <- glm_model$family$linkinvpredicted <-linkinv(eta)lower_ci <-linkinv(eta_lower)upper_ci <-linkinv(eta_upper)# Add to covs_dfcovs_df$predicted <- predictedcovs_df$lower_ci <- lower_cicovs_df$upper_ci <- upper_cipredicted_r <-setValues(rast(covs_future_SSP126_expert_HF_scaled, nlyr =1), predicted)lower_ci_r <-setValues(rast(covs_future_SSP126_expert_HF_scaled, nlyr =1), lower_ci)upper_ci_r <-setValues(rast(covs_future_SSP126_expert_HF_scaled, nlyr =1), upper_ci)# Step 2: Name the layersnames(predicted_r) <-"SSP126 Mean Estimate"names(lower_ci_r) <-"SSP126 Lower CI"names(upper_ci_r) <-"SSP126 Upper CI"prediction_w_uncertainty <-c(predicted_r, lower_ci_r, upper_ci_r)plot(prediction_w_uncertainty, range =c(0, 1))```# References::: {#refs}:::# Session information```{r}sessionInfo()```